{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "bTry4ZMD2859"
   },
   "source": [
    "# What-If Tool Challenge Lab\n",
    "\n",
    "In this notebook, you will use mortgage data from NY in 2017 to create two binary classifiers to determine if a mortgage applicant will be granted a loan.\n",
    "\n",
    "You will train classifiers on two datasets. One will be trained on the complete dataset, and the other will be trained on a subset of the dataset, where 90% of the female applicants that were granted a loan were removed from the training data (so the dataset has 90% less females that were granted loans).\n",
    "\n",
    "You will then compare and examine the two models using the What-If Tool.\n",
    "\n",
    "In this notebook, you will be exepcted to:\n",
    "* Understand how the data is processed \n",
    "* Write TensorFlow code to build and train two models\n",
    "* Write code to deploy the the models to AI Platform\n",
    "* Examine the models in the What-If Tool"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "zU9bzX-VWQCb"
   },
   "source": [
    "# Download and import the data\n",
    "\n",
    "Here, you'll import some modules and download some data from the Consumer Finance public [datasets](https://www.consumerfinance.gov/data-research/hmda/historic-data/?geo=ny&records=all-records&field_descriptions=labels)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "nhmYvLmUxSqU"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import collections\n",
    "from sklearn import preprocessing\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import accuracy_score, confusion_matrix\n",
    "from sklearn.utils import shuffle\n",
    "from witwidget.notebook.visualization import WitWidget, WitConfigBuilder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "oVhFQBvggsio"
   },
   "outputs": [],
   "source": [
    "!wget https://files.consumerfinance.gov/hmda-historic-loan-data/hmda_2017_ny_all-records_labels.zip\n",
    "!unzip hmda_2017_ny_all-records_labels.zip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "uFyKHeHZD1e6"
   },
   "source": [
    "# Process the Data\n",
    "\n",
    "In this section, you **don't need to write any code**. We suggest you read through the cells to understand how the dataset is processed.\n",
    "\n",
    "Here, we start by importing the dataset into a Pandas dataframe. Then we process the data to exclude incomplete information and make a simple binary classification of loan approvals. We then create two datasets, one complete and one where 90% of female applicants are removed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "LSsrdPdyCVYn"
   },
   "outputs": [],
   "source": [
    "# Set column dtypes for Pandas\n",
    "column_names = collections.OrderedDict({\n",
    "  'as_of_year': np.int16,\n",
    "  'agency_abbr': 'category',\n",
    "  'loan_type': 'category',\n",
    "  'property_type': 'category',\n",
    "  'loan_purpose': 'category',\n",
    "  'owner_occupancy': np.int8,\n",
    "  'loan_amt_000s': np.float64,\n",
    "  'preapproval': 'category',\n",
    "  'county_code': np.float64,\n",
    "  'applicant_income_00s': np.float64,\n",
    "  'purchaser_type': 'category',\n",
    "  'hoepa_status': 'category',\n",
    "  'lien_status': 'category',\n",
    "  'population': np.float64,\n",
    "  'ffiec_median_fam_income': np.float64,\n",
    "  'tract_to_msamd_income': np.float64,\n",
    "  'num_of_owner_occupied_units': np.float64,\n",
    "  'number_of_1_to_4_family_units': np.float64,\n",
    "  'approved': np.int8, \n",
    "  'applicant_race_name_3': 'category',\n",
    "  'applicant_race_name_4': 'category',\n",
    "  'applicant_race_name_5': 'category',\n",
    "  'co_applicant_race_name_3': 'category',\n",
    "  'co_applicant_race_name_4': 'category',\n",
    "  'co_applicant_race_name_5': 'category'\n",
    "})\n",
    "\n",
    "# Import the CSV into a dataframe\n",
    "data = pd.read_csv('hmda_2017_ny_all-records_labels.csv', dtype=column_names)\n",
    "data = shuffle(data, random_state=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "5fMc5a2eY3Kh"
   },
   "source": [
    "## Extract columns and create dummy dataframes\n",
    "\n",
    "We first specify which columns to keep then drop the columns that don't have `loan originated` or `loan denied`, to make this a simple binary classification.\n",
    "\n",
    "We then create two dataframes `binary_df` and `bad_binary_df`. The first will include all the data, and the second will have 90% of female applicants removed, respectively. We then convert them into \"dummy\" dataframes to turn categorical string features into simple 0/1 features and normalize all the columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "qWNJwq2-Htxz"
   },
   "outputs": [],
   "source": [
    "# Only use a subset of the columns for these models\n",
    "text_columns_to_keep = [\n",
    "             'agency_name',\n",
    "             'loan_type_name',\n",
    "             'property_type_name',\n",
    "             'loan_purpose_name',\n",
    "             'owner_occupancy_name',\n",
    "             'applicant_ethnicity_name',\n",
    "             'applicant_race_name_1',\n",
    "             'applicant_sex_name',                      \n",
    "]\n",
    "numeric_columns_to_keep = [\n",
    "             'loan_amount_000s',\n",
    "             'applicant_income_000s',\n",
    "             'population',\n",
    "             'minority_population',\n",
    "             'hud_median_family_income' \n",
    "]\n",
    "\n",
    "columns_to_keep = text_columns_to_keep + numeric_columns_to_keep + ['action_taken_name']\n",
    "\n",
    "# Drop columns with incomplete information and drop columns that don't have loan orignated or denied, to make this a simple binary classification\n",
    "df = data[columns_to_keep].dropna()\n",
    "binary_df = df[df.action_taken_name.isin(['Loan originated', 'Application denied by financial institution'])].copy()\n",
    "binary_df.loc[:,'loan_granted'] = np.where(binary_df['action_taken_name'] == 'Loan originated', 1, 0)\n",
    "binary_df = binary_df.drop(columns=['action_taken_name'])\n",
    "\n",
    "# Drop 90% of loaned female applicants for a \"bad training data\" version\n",
    "loaned_females = (binary_df['applicant_sex_name'] == 'Female') & (binary_df['loan_granted'] == 1)\n",
    "bad_binary_df = binary_df.drop(binary_df[loaned_females].sample(frac=.9).index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ic6mWTvENrLd"
   },
   "outputs": [],
   "source": [
    "# Now lets' see the distribution of approved / denied classes (0: denied, 1: approved)\n",
    "print(binary_df['loan_granted'].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "6h3kQmIqMLYr"
   },
   "outputs": [],
   "source": [
    "# Turn categorical string features into simple 0/1 features (like turning \"sex\" into \"sex_male\" and \"sex_female\")\n",
    "dummies_df = pd.get_dummies(binary_df, columns=text_columns_to_keep)\n",
    "dummies_df = dummies_df.sample(frac=1).reset_index(drop=True)\n",
    "\n",
    "bad_dummies_df = pd.get_dummies(bad_binary_df, columns=text_columns_to_keep)\n",
    "bad_dummies_df = bad_dummies_df.sample(frac=1).reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "3VfdY4PzWOoI"
   },
   "outputs": [],
   "source": [
    "# Normalize the numeric columns so that they all have the same scale to simplify modeling/training\n",
    "def normalize():\n",
    "  min_max_scaler = preprocessing.MinMaxScaler()\n",
    "  column_names_to_normalize = ['loan_amount_000s', 'applicant_income_000s', 'minority_population', 'hud_median_family_income', 'population']\n",
    "  x = dummies_df[column_names_to_normalize].values\n",
    "  x_scaled = min_max_scaler.fit_transform(x)\n",
    "  df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = dummies_df.index)\n",
    "  dummies_df[column_names_to_normalize] = df_temp\n",
    "\n",
    "  x = bad_dummies_df[column_names_to_normalize].values\n",
    "  x_scaled = min_max_scaler.fit_transform(x)\n",
    "  bad_df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = bad_dummies_df.index)\n",
    "  bad_dummies_df[column_names_to_normalize] = bad_df_temp\n",
    "\n",
    "normalize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "m20NBqsMaMkx"
   },
   "source": [
    "## Get the Train & Test Data\n",
    "\n",
    "Now, let's get the train and test data for our models.\n",
    "\n",
    "For the **first** model, you'll use `train_data` and `train_labels`.\n",
    "\n",
    "For the **second** model, you'll use `limited_train_data` and `limited_train_labels`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Np8JM4KINnKC"
   },
   "outputs": [],
   "source": [
    "# Get the training data & labels\n",
    "test_data_with_labels = dummies_df\n",
    "\n",
    "train_data = dummies_df\n",
    "train_labels = train_data['loan_granted']\n",
    "train_data = train_data.drop(columns=['loan_granted'])\n",
    "\n",
    "# Get the bad (limited) training data and labels\n",
    "limited_train_data = bad_dummies_df\n",
    "limited_train_labels = limited_train_data['loan_granted']\n",
    "limited_train_data = bad_dummies_df.drop(columns=['loan_granted'])\n",
    "\n",
    "# Split the data into train / test sets for Model 1\n",
    "x,y = train_data,train_labels\n",
    "train_data,test_data,train_labels,test_labels = train_test_split(x,y)\n",
    "\n",
    "# Split the bad data into train / test sets for Model 2\n",
    "lim_x,lim_y=limited_train_data,limited_train_labels\n",
    "limited_train_data,limited_test_data,limited_train_labels,limited_test_labels = train_test_split(lim_x,lim_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "MyUxXszu0Mp0"
   },
   "source": [
    "# Create and train your TensorFlow models\n",
    "\n",
    "In this section, you will write code to train two TensorFlow Keras models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "K685pKOMUPQD"
   },
   "source": [
    "## Train your first model on the complete dataset.\n",
    "\n",
    "* **Important**: your first model should be saved in the location `saved_complete_model/saved_model.pb`.\n",
    "* The data will come from `train_data` and `train_labels`.\n",
    "\n",
    "If you get stuck, you can view the documentation [here](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "PvgHgZ-agsi_"
   },
   "outputs": [],
   "source": [
    "# import TF modules\n",
    "from tensorflow.keras import layers\n",
    "from tensorflow.keras import initializers\n",
    "from tensorflow.keras import optimizers\n",
    "from tensorflow.keras.models import Sequential\n",
    "from tensorflow.keras.layers import Dense"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "l4qrBBr5bUSK"
   },
   "outputs": [],
   "source": [
    "# This is the size of the array you'll be feeding into our model for each example\n",
    "input_size = len(train_data.iloc[0])\n",
    "\n",
    "# Train the first model on the complete dataset. Use `train_data` for your data and `train_labels` for you labels.\n",
    "\n",
    "# ---- TODO ---------\n",
    "# create the model = Sequential()\n",
    "# model.add (your layers)\n",
    "# model.compile\n",
    "# model.fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "CWGEGaxPgsjD"
   },
   "outputs": [],
   "source": [
    "# Save your model\n",
    "!mkdir -p saved_complete_model\n",
    "model.save('saved_complete_model') "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "hg0bnNVwgsjF"
   },
   "outputs": [],
   "source": [
    "# Get predictions on the test set and print the accuracy score (Model 1)\n",
    "y_pred = model.predict(test_data)\n",
    "acc = accuracy_score(test_labels, y_pred.round())\n",
    "print(\"Model 1 Accuracy: %.2f%%\" % (acc * 100.0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "U2hPhuA-UXTT"
   },
   "source": [
    "## Train your second model on the limited datset.\n",
    "\n",
    "* **Important**: your second model should be saved in the location `saved_limited_model/saved_model.pb`.\n",
    "* The data will come from `limited_train_data` and `limited_train_labels`.\n",
    "\n",
    "\n",
    "If you get stuck, you can view the documentation [here](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "NP8cr7JvgsjH"
   },
   "outputs": [],
   "source": [
    "# Train your second model on the limited dataset. Use `limited_train_data` for your data and `limited_train_labels` for your labels.\n",
    "# Use the same input_size for the limited_model\n",
    "\n",
    "# ---- TODO ---------\n",
    "# create the limited_model = Sequential()\n",
    "# limited_model.add (your layers)\n",
    "# limited_model.compile\n",
    "# limited_model.fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "5UauXNlMgsjK"
   },
   "outputs": [],
   "source": [
    "# Save your model\n",
    "!mkdir -p saved_limited_model\n",
    "limited_model.save('saved_limited_model') "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "n0UxiCcygsjM"
   },
   "outputs": [],
   "source": [
    "# Get predictions on the test set and print the accuracy score (Model 2)\n",
    "limited_y_pred = limited_model.predict(limited_test_data)\n",
    "acc = accuracy_score(limited_test_labels, limited_y_pred.round())\n",
    "print(\"Model 2 Accuracy: %.2f%%\" % (acc * 100.0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "-5X33HRf0b2C"
   },
   "source": [
    "# Deploy your models to the AI Platform\n",
    "\n",
    "In this section, you will first need to create a Cloud Storage bucket to store your models, then you will use gcloud commands to copy them over.\n",
    "\n",
    "You will then create two AI Platform model resources and their associated versions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Jfp8H0esC6k_"
   },
   "outputs": [],
   "source": [
    "# ---- TODO ---------\n",
    "\n",
    "# Fill out this information:\n",
    "\n",
    "GCP_PROJECT = '#TODO'\n",
    "MODEL_BUCKET = 'gs:// #TODO'\n",
    "MODEL_NAME = 'saved_complete_model' #do not modify\n",
    "LIM_MODEL_NAME = 'saved_limited_model' #do not modify\n",
    "VERSION_NAME = 'v1'\n",
    "REGION = 'us-central1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "NJOTCAsLDjcF"
   },
   "outputs": [],
   "source": [
    "# Copy your model files to Cloud Storage (these file paths are your 'origin' for the AI Platform Model)\n",
    "!gcloud storage cp --recursive ./saved_complete_model $MODEL_BUCKET\n",    "!gcloud storage cp --recursive ./saved_limited_model $MODEL_BUCKET"   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "NJOTCAsLDjcF"
   },
   "outputs": [],
   "source": [
    "# Configure gcloud to use your project\n",
    "!gcloud config set project $GCP_PROJECT\n",
    "!gcloud config set ai_platform/region global"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "V1RF5Ga_HAva"
   },
   "source": [
    "## Create your first AI Platform model: *saved_complete_model*\n",
    "\n",
    "Navigate back to the Google Cloud Console to complete this task. See the lab guide for details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "TNCuzUbsKuUv"
   },
   "source": [
    "## Create your second AI Platform model: *saved_limited_model*\n",
    "\n",
    "Navigate back to the Google Cloud Console to complete this task. See the lab guide for details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "4IZAJ1LrqUha"
   },
   "source": [
    "# Using the What-if Tool to interpret your model\n",
    "Once your models have deployed, you're now ready to connect them to the What-if Tool using the WitWidget. \n",
    "\n",
    "We've provided the Config Builder code and a couple of functions to get the class predictions from the models, which are necessary inputs for the WIT. If you've successfully deployed and saved your models, all you'll need to do is **add the WitConfigBuilder code in this cell**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "bQrAb7lbOhvI"
   },
   "outputs": [],
   "source": [
    "#@title Show model results in WIT\n",
    "num_datapoints = 1000  #@param {type: \"number\"}\n",
    "\n",
    "# Column indices to strip out from data from WIT before passing it to the model.\n",
    "columns_not_for_model_input = [\n",
    "    test_data_with_labels.columns.get_loc('loan_granted'),\n",
    "]\n",
    "\n",
    "# Return model predictions.\n",
    "def custom_predict(examples_to_infer):\n",
    "  # Delete columns not used by model\n",
    "  model_inputs = np.delete(\n",
    "      np.array(examples_to_infer), columns_not_for_model_input, axis=1).tolist()\n",
    "  # Get the class predictions from the model.\n",
    "  preds = model.predict(model_inputs)\n",
    "  preds = [[1 - pred[0], pred[0]] for pred in preds]\n",
    "  return preds\n",
    "  \n",
    "# Return 'limited' model predictions.\n",
    "def bad_custom_predict(examples_to_infer):\n",
    "  # Delete columns not used by model\n",
    "  model_inputs = np.delete(\n",
    "      np.array(examples_to_infer), columns_not_for_model_input, axis=1).tolist()\n",
    "  # Get the class predictions from the model.\n",
    "  preds = limited_model.predict(model_inputs)\n",
    "  preds = [[1 - pred[0], pred[0]] for pred in preds]\n",
    "  return preds\n",
    "\n",
    "examples_for_wit = test_data_with_labels.values.tolist()\n",
    "column_names = test_data_with_labels.columns.tolist()\n",
    "\n",
    "# ---- TODO ------ \n",
    "\n",
    "## Add WitConfigBuilder code here"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "machine_shape": "hm",
   "name": "what-if-tool-challenge.ipynb",
   "provenance": []
  },
  "environment": {
   "name": "tf2-2-3-gpu.2-3.m59",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-2-3-gpu.2-3:m59"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
